Introduction to SDMs

zoontutorials team

6 February 2017

Why do we fit species distribution models?

In ecology, we often want to understand where species are distributed in the environment and why. Sometimes we ask these questions to gain more knowledge about a species, yet other times we simply need to know where it is to better manage it.

What is a species distribution model?

Where we find species in geographic species is the result of three factors: the abiotic environment, the biotic environment, and the species’ movement. Together, these three factors determine the species’ ecological niche. For example, a species may be constrained by the climate required for individuals to survive, the local vegetation types it depends on, and its abillity to traverse different landscapes. The intersection of the abiotic and biotic environmental conditions constitutes the fundamental niche, or where the species could physiologically occur given what we know of its biology and ecology. However, experience tells us that species are not always located where we expect them to be. Their realised niche is where they actually exist within the bounds of the environmental space where they could potentially exist.

Species distribution models estimate a species’ realised niche. They do this by estimating the probabilty of occurrence of species along environmental gradients. They do this by estimating the correlation between environmental covariates and species occurrence data. For example, using SDMs, we can estiamte the probability a species of bird will use forest habitat with high tree cover compared to open habitat with low tree cover. We can then take these statisical relationships and project them onto geographic space, which allows us to visualise on a map how the probability of species occurrence varies in geographic space.

We have visualised this theory in Figure 1 below. This dataset uses presence data of the Carolina Wren as well as a suite of environmental predictors such as cover of different forest types. In the top left corner we have mapped onto geographic space the records of species occurrence (black dots) as well as the background points we will use in our model. In the top right we see these same points plotted in environmental space against two covariates. We can see that the presence points are clustered at higher percents of deciduous forest and lower percents of mixed forest. The model uses this presence background data to construct predictions of species occurrence in environmental space, as seen in the bottom right plot. In the final plot, bottom left, we see these probability predictions mapped back onto geographic space in a way that allows us to understand where and by what drivers the species is likely to occur. There are many types of species distribution models and the field is rapidly expanding. expand…many different way and decision to be made along the way

*Figure 1. Species Distribution Model Theory. Presence-background points plotted on geographic space, on environmental space, and probability of occurrence predictions plotted on environmental and geographic space.*

Figure 1. Species Distribution Model Theory. Presence-background points plotted on geographic space, on environmental space, and probability of occurrence predictions plotted on environmental and geographic space.

How do I fit a species distribution model with zoon?

zoon is concerned with correlative SDMs, as described above, which are the most widely used SDM. This tutorial will focus on distribution models that estimate probability of occurrence, however, we could also use species abundance data to estimate how abundance varies along environemntal gradients.

We implement a species distribution model with five steps: make more zoon perspective

  1. Occurence: We first gather and format our species occurrence data.
  2. Covariates: Next, we gather and format the environmental covariates we believe are important to our species of interest.
  3. Process: Often our data, both occurrence and environmental, will need some pre-processing before we fit our models.
  4. Model: We then fit one or more statistical models to our data to estimate species probability of occurrence.
  5. Output: Finally, after fitting our model, we produce model outputs, such as graphs and maps, to enable us to make ecological inference about the species.

The zoon workflow is structured around these five steps and designed to make building and fitting SDMs straight-foward and reproducible. The primary zoon function is workflow(), which we use to fit the SDM. The workflow function has five arguments, one for each step in the SDM fitting process. For each argument, we need only to select a ‘module.’ The modules we choose in each step determine what type of model we run with what data and what outputs we produce.

This tutorial will guide you through the process of selecting a module for each argument of the workflow() function. Along the way, we’ll introduce some key factors that you should consider when fitting and evaluating an SDM.

But first, let’s fit a quick and simple SDM with zoon as a means of introduction. Don’t forget to load the package!

library(zoon)

A basic workflow for the Carolina Wren in the contiguous USA could look like this:

zoon_workflow <- workflow(occurrence = CarolinaWrenPO,
                          covariate = CarolinaWrenRasters,
                          process = Background(100),
                          model = MaxEnt,
                          output = InteractiveMap)

In this workflow, we have obtained occurrence records for the Carolina Wren from a zoon module, loaded environmental covariates from another zoon module, selected 100 background points (more on this later), fit a MaxEnt model, and generated our results as an interactive map. There are multiple ways to visualise the output of a model and we will cover them in a later section.

Throughout this tutorial we will guide you step-by-step through each of the arguments in a workflow as we update this zoon_workflow. Be sure to explore the other tutorials…

zoon comes with several pre-existing modules for each argument, and we will go through a selection of these modules in more detail in this tutorial. You might like to explore some different combinations yourself, and you can find more information by running the GetModuleList() command.

The five zoon workflow steps edit

Step 1. Occurrence

Species distribution models are fitted with species occurrence data, the most common of which are: presence-only, presence-background, and presence-absence. Less commonly, species distribution models can be fitted with abundance data.

Presence-only data is widely and freely accesible only and as a result is more commonly used than presence-absence data. As such, we write this tutorial for presence-only data and note any necessary alterations for presence-absence data.

The first module required in a workflow() is occurrence and it is where we load our species occurrence data. There are three methods for loading data into our workflow depending on where the data is sourced: we can use pre-existing occurrence modules in zoon like Lorem_ipsum_UK, download data from an online repository using SpOcc, or load in our own data from a local computer using LocalOccurrenceData.

zoon has several functions available to view the contents of each module in the workflow. For example, we can view our species occurrence data using the Occurrence() accessor function (using the head() function to see only the first six lines):

head(Occurrence(zoon_workflow))
##   longitude latitude value     type fold
## 1 -87.50607 34.92022     1 presence    1
## 2 -87.78094 34.93984     1 presence    1
## 3 -88.05596 34.95882     1 presence    1
## 4 -86.73598 34.41256     1 presence    1
## 5 -86.70923 34.63505     1 presence    1
## 6 -86.98288 34.65655     1 presence    1

The first two columns provide the geographic location of the observation (as longitude and latitude), the third column is the observation value (1 = presence, 0 = absence), the fourth column is the type of observation, and the last column identifies the “fold” (folds are covered later in the section of Process modules, but a default model can be considered to always have one fold).

Step 2. Covariates

The second module required in a workflow is for environmental data. As with occurrence modules, there are lots of different modules to choose from. There are zoon modules that contain data, such as CarolinaWrenRasters. There are modules that source data from online sources such as Bioclim. And we can also load our own data from a local computer using LocalRaster.

Covariates use the rasterStack data format to store data and the Covariate() function extracts the covariates from the workflow.

Covariate(zoon_workflow)
## class       : RasterBrick 
## dimensions  : 165, 297, 49005, 6  (nrow, ncol, ncell, nlayers)
## resolution  : 0.29, 0.223  (x, y)
## extent      : -138.7131, -52.58307, 18.1523, 54.9473  (xmin, xmax, ymin, ymax)
## coord. ref. : +init=epsg:4326 +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0 
## data source : in memory
## names       :      pcMix,      pcDec,      pcCon,       pcGr,        Lat,        Lon 
## min values  : -0.2816289, -0.4023483, -0.4118219, -0.4333224, -1.5588675, -1.3447938 
## max values  :  10.949743,   5.785296,   5.671915,   5.118029,   1.505573,   1.255225

Using what we’ve learned here, lets update zoon_workflow so that we can build an SDM for the Carolina Wren in the USA using data from online repositories rather than a zoon data module. We keep all other aspects of the work the same.

zoon_workflow <- ChangeWorkflow(workflow = zoon_workflow,
                                occurrence = SpOcc("Thryothorus ludovicianus", 
                                                   extent = c(-138.71, -52.58, 18.15, 54.95)),
                                covariate = Bioclim(extent = c(-138.71, -52.58, 18.15, 54.95)))

Step 3. Process

Now that we have loaded in our species and environmental data using the occurrence and covariate modules, the process modules will perform any pre-processing of our data required before fitting the model itself. This is where we can modify our raw data by removing poor data points using the Clean module or standardising our covariates using StandardiseCov, generate the background data points required for presence-only data analysis using Background, add interaction terms between our covariates using addInteractions, or set up model validation methods using modules like CrossValidate.

Let’s check out our environmental data by using the function Covariate(). This function returns a summary of our raster data. In this case, we have five covariates stored in raster layers which are themselves stored in a raster brick. Our covariates are the Bioclim variables bio1, bio2, bio3, bio4, and bio5.

Covariate(zoon_workflow)   # Before standardisation
## class       : RasterBrick 
## dimensions  : 221, 517, 114257, 5  (nrow, ncol, ncell, nlayers)
## resolution  : 0.1666667, 0.1666667  (x, y)
## extent      : -138.6667, -52.5, 18.16667, 55  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0 
## data source : in memory
## names       :  bio1,  bio2,  bio3,  bio4,  bio5 
## min values  :   -57,    32,    18,   893,   135 
## max values  :   287,   211,    78, 14175,   448

Since we are using presence-only data in our model we need to generate some background data, also known as pseudo-absences. Depending on the type of model (see next section), this data is used to either sample the range of environmental space in the landscape to compare it to where the species has been found, or as a non-presence class of data. We also want to standardise our covariates, so to use multiple modules in an argument we can use the Chain() function. Here we use StandardiseCov to standardise our covariates and Background to generate 1000 (increased from 100 in early examples) background points.

zoon_workflow <- ChangeWorkflow(workflow = zoon_workflow,
                                process = Chain(StandardiseCov, Background(n = 1000)))

In some instances we may not need a process module in your workflow, however, it is a mandatory argument and so the NoProcess module can be used as a ‘blank’ module.

Step 4. Model

A fundamental aspect of the workflow is the model module. This is where we choose which type of SDM you want to fit to our data. There are multiple different SDM methods to choose from and each have their mertis (Elith et al, 2006). A few more common examples include the LogisticRegression, mgcv, MaxEnt, RandomForest, and GBM modules. For more detail on these methods you can refer to the “Choosing A Model” vignette.

Now, let’s update our workflow with another common model type, logistic regression. We do this by using the aptly named LogisticRegression module.

zoon_workflow <- ChangeWorkflow(workflow = zoon_workflow,
                                model = LogisticRegression)

Step 5. Output

Once we’re happy with our included data and the model type we’ve chose. And we’ve included all relevant processes, it is finally time to check our results.

Let’s start by mapping our probability of occurrence predictions onto geographic space. We can use the InteractiveMap module to create a map showing the predicted species distribution and over-lying presence points.

zoon_workflow <- ChangeWorkflow(workflow = zoon_workflow,
                                output = InteractiveMap)

The InteractiveMap module produces a map of occurrence probability that we can interact with. The map shows a scale from light green to purple and overlays our raw data on top of the map so we can see how our data align with model predictions. We can click on our raw data to get information about it.

The InteractiveMap shows us, according to the data we included, the Carolina Wren is most likely to occur in the southeast of the country (higher probability of occurrence: purple) and is less likely to exist in the northwest of the country (lower probability of occurrence: yellow). These quantitative predictions fit with a qualitative analysis of our data; most of our presence records (white dots) are in the southeast of the map.

Now we’ve seen how Carolina Wren occurrence varies over geographic space, let’s look more closely at what is driving that variation. That is, how does probability of occurrence vary along the environmental gradients we included in our model. For this, we can use the ResponsePlot module to make graphs of the predicted relationships between probabilty of occurrence and our environmental covariate.

Without input arguments, ResponsePlot plots a graph for each covariate included in our model. We can also specify, which covariate we want to plot.

zoon_workflow <- ChangeWorkflow(zoon_workflow,
                                output = ResponsePlot())

Putting it all together

Now that we have seen how to use each argument in the workflow() function it is time to put all of the pieces together and fit our own SDM within zoon. As MaxEnt is arguably one of the most popular SDM methodologies, lets fit a MaxEnt model in addition to the LogisticRegression module. This is achieved using the List() function. We will fit these models to the Carolina Wren data obtained from GBIF and the Bioclim environmental variables, generate 1000 background samples, and display our results in a map without our datapoints displayed. We have covered each of the necesssary modules previously, and here we will run them together to form a complete workflow().

Dual_Model_Workflow <- workflow(occurrence = SpOcc("Thryothorus ludovicianus", 
                                                   extent = c(-138.71, -52.58, 18.15, 54.95)),
                                covariate = Bioclim(extent = c(-138.71, -52.58, 18.15, 54.95)),
                                process = Background(1000),
                                model = list(LogisticRegression, MaxEnt),
                                output = PrintMap(points = FALSE))

There are some obvious differences in the predicted distribution maps of these two models. There are no differences between the inputs or outputs for each model because we used a single workflow. That means each model used the same data, but the predicted distribution of the species is different between the models. What causes the difference? Check out our more detailed guide on SDM algorithm selection here “Choosing A Model” vignette

Conclusion

We’ve used the zoon package to successfully run and interpret an easily reproducible and sharable species distribtion model. sentence advocating for how this was a better process than not this We loaded our occurrence and covariate data for our species of interest, pre-processed that data as required, ran our model and produced some outputs. The interpretation of those outputs has increased our knowledge of the Carolina wren: we’ve learned the Carolina wren is most likely to occur in southeast USA. We’ve also learned that the wren does not particuarly mind the relative cover of forest types, having similar probability of occuring in each.